Anomaly Detection

In this notebook, we’ll implement anomaly detection in time series data

This tutorial assumes basic familiarity with StatsForecast. For a minimal example visit the Quick Start

Introduction

Anomaly detection is a crucial task in time series forecasting. It involves identifying unusual observations that don’t follow the expected dataset patterns. Anomalies, also known as outliers, can be caused by a variety of factors, such as errors in the data collection process, sudden changes in the underlying patterns of the data, or unexpected events. They can pose problems for many forecasting models since they can distort trends, seasonal patterns, or autocorrelation estimates. As a result, anomalies can have a significant impact on the accuracy of the forecasts, and for this reason, it is essential to be able to identify them. Furthermore, anomaly detection has many applications across different industries, such as detecting fraud in financial data, monitoring the performance of online services, or identifying usual patterns in energy usage.

By the end of this tutorial, you’ll have a good understanding of how to detect anomalies in time series data using StatsForecast’s probabilistic models.

Outline:

  1. Install libraries
  2. Load and explore data
  3. Train model
  4. Recover insample forecasts and identify anomalies
Important

Once an anomaly has been identified, we must decide what to do with it. For example, we could remove it or replace it with another value. The correct course of action is context-dependent and beyond this notebook’s scope. Removing an anomaly will likely improve the accuracy of the forecast, but it can also underestimate the amount of randomness in the data.

Tip

You can use Colab to run this Notebook interactively Open In Colab

Install libraries

We assume that you have StatsForecast already installed. If not, check this guide for instructions on how to install StatsForecast

Install the necessary packages using pip install statsforecast

pip install statsforecast -U

Load and explore the data

For this example, we’ll use the hourly dataset of the M4 Competition. We’ll first import the data from datasetsforecast, which you can install using pip install datasetsforecast

pip install datasetsforecast -U
from datasetsforecast.m4 import M4

The function to load the data is M4.load. It requieres the following two arguments:

  • directory: (str) The directory where the data will be downloaded.
  • group: (str). The group name, which can be Yearly, Quarterly, Monthly, Weekly, Daily or Hourly.

This function returns multiple outputs, but only the first one with the target series is needed.

df_total, *_ = M4.load('./data', 'Hourly')
df_total.head()
unique_id ds y
0 H1 1 605.0
1 H1 2 586.0
2 H1 3 586.0
3 H1 4 559.0
4 H1 5 511.0

The input to StatsForecast is always a data frame in long format with three columns: unique_id, df and y.

  • unique_id: (string, int or category) A unique identifier for the series.
  • ds: (datestamp or int) A datestamp in format YYYY-MM-DD or YYYY-MM-DD HH:MM:SS or an integer indexing time.
  • y: (numeric) The measurement we wish to forecast.

In this case, the unique_id and y columns already have the requiered format, but we need to change the data type of the ds column.

df_total['ds'] = df_total['ds'].astype(int)

From this dataset, we’ll select the first 8 time series to reduce the total execution time. You can select any number you want by changing the value of n_series.

n_series = 8 
uids = df_total['unique_id'].unique()[:n_series]
df = df_total.query('unique_id in @uids')

We can plot these series using the plot method from the StatsForecast class. This method has multiple parameters, and the required ones to generate the plots in this notebook are explained below.

  • df: A pandas dataframe with columns [unique_id, ds, y].
  • forecasts_df: A pandas dataframe with columns [unique_id, ds] and models.
  • unique_ids: (list[str]) A list with the ids of the time series we want to plot.
  • plot_random: (bool = True) Plots the time series randomly.
  • plot_anomalies: (bool = False) Plots anomalies for each prediction interval.
  • engine: (str = plotly). The library used to generate the plots. It can also be matplotlib for static plots.
from statsforecast import StatsForecast
StatsForecast.plot(df, plot_random = False)

Train model

To generate the forecast, we’ll use the MSTL model, which is well-suited for low-frequency data like the one used here. We first need to import it from statsforecast.models and then we need to instantiate it. Since we’re using hourly data, we have two seasonal periods: one every 24 hours (hourly) and one every 24*7 hours (daily). Hence, we need to set season_length = [24, 24*7].

from statsforecast.models import MSTL

# Create a list of models and instantiation parameters 
models = [MSTL(season_length = [24, 24*7])]

To instantiate a new StatsForecast object, we need the following parameters:

  • df: The dataframe with the training data.
  • models: The list of models defined in the previous step.
  • freq: A string indicating the frequency of the data. See pandas’ available frequencies.
  • n_jobs: An integer that indicates the number of jobs used in parallel processing. Use -1 to select all cores.
sf = StatsForecast(
    df = df, 
    models = models, 
    freq = 'H', 
    n_jobs = -1
)

We’ll now predict the next 48 hours. To do this, we’ll use the forecast method, which requieres the following arguments:

  • h: (int) The forecasting horizon.
  • level: (list[float]) The confidence levels of the prediction intervals
  • fitted: (bool = False) Returns insample predictions.

It is important that we select a level and set fitted = True since we’ll need the insample forecasts and their prediction intervals to detect the anomalies.

horizon = 48
levels = [99] 

fcst = sf.forecast(h = 48, level = levels, fitted = True)
fcst = fcst.reset_index()
fcst.head()
unique_id ds MSTL MSTL-lo-99 MSTL-hi-99
0 H1 749 615.943970 597.662170 634.225708
1 H1 750 559.297791 531.316650 587.278931
2 H1 751 515.693542 479.151337 552.235718
3 H1 752 480.719269 436.241547 525.197021
4 H1 753 467.146484 415.199738 519.093262

We can plot the forecasts using the plot method from before.

StatsForecast.plot(df, fcst, plot_random = False)

Recover insample forecasts and identify anomalies

In this example, an anomaly will be any observation outside the prediction interval of the insample forecasts for a given confidence level (here we selected 99%). Hence, we first need to recover the insample forecasts using the forecast_fitted_values method.

insample_forecasts = sf.forecast_fitted_values().reset_index()
insample_forecasts.head()
unique_id ds y MSTL MSTL-lo-99 MSTL-hi-99
0 H1 1 605.0 604.924500 588.010376 621.838623
1 H1 2 586.0 585.221802 568.307678 602.135925
2 H1 3 586.0 589.740723 572.826599 606.654846
3 H1 4 559.0 557.778076 540.863953 574.692200
4 H1 5 511.0 506.747009 489.832886 523.661133

We can now find all the observations above or below the 99% prediction interval for the insample forecasts.

anomalies = insample_forecasts.loc[(insample_forecasts['y'] >= insample_forecasts['MSTL-hi-99']) | (insample_forecasts['y'] <= insample_forecasts['MSTL-lo-99'])]
anomalies.head()
unique_id ds y MSTL MSTL-lo-99 MSTL-hi-99
168 H1 169 813.0 779.849792 762.935669 796.763916
279 H1 280 692.0 672.638123 655.723999 689.552246
289 H1 290 770.0 792.015442 775.101318 808.929565
308 H1 309 844.0 867.809387 850.895203 884.723511
336 H1 337 853.0 822.427002 805.512878 839.341187

We can plot the anomalies by adding the plot_anomalies = True argument to the plot method.

StatsForecast.plot(insample_forecasts, plot_random = False, plot_anomalies = True)

If we want to take a closer look, we can use the unique_ids argument to select one particular time series, for example, H10.

StatsForecast.plot(insample_forecasts, unique_ids = ['H10'], plot_anomalies = True)

Here we identified the anomalies in the data using the MSTL model, but any probabilistic model from StatsForecast can be used. We also selected the 99% prediction interval of the insample forecasts, but other confidence levels can be used as well.

Give us a ⭐ on Github